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Abstract 

We calculate resistance distances between papers 
in a nearly bipartite citation network of 492 papers 
and the sources cited by them. We validate that this 
is a realistic measure of thematic distance if each 
citation link has an electric resistance equal to the 
geometric mean of the number of the paper's refer- 
ences and the citation number of the cited source. 



1 Introduction 

It is often useful to be able to determine the the- 
matic similarity of two scholarly papers which is 
equivalent to knowing their thematic distance in a 
hypothetical space of concepts or in the genealog- 
ical tree of knowledge. Modelling a scholarly pa- 
per by a set of terms and cited sources we face the 
problem that we cannot calculate the thematic sim- 
ilarity of two papers which themselves do not share 
terms respectively sources in their reference lists. In 
bibliometric terms two such papers are neither bib- 
liographically nor lexically coupled but it would not 
be adequate to assume that they are totally unre- 
lated because both types of lists, that of references 
and that of terms, are incomplete in general: 

• papers do not cite all of their intellectual an- 
cestors, 
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• very general descriptors are often not included 
in term lists. 

In this paper we only discuss the case of citation 
networks of papers. Augmenting citation data with 
terms improves similarity estimation but we leave 
this opportunity for future work. 

One solution of the problem of incomplete ref- 
erence lists could be to search for all intellectual 
ancestors i.e. for indirect citation links between pa- 
pers in the past. This would be a tedious task based 
on incomplete data because not all references of ref- 
erences are indexed in citation databases. Here we 
do not rely on indirect citation links in the past but 
on indirect connections in networks of papers and 
their cited sources in a time slice of one year. 

Earlier we tested whether thematic distances be- 
tween papers could be estimated by the length of 
the shortest path between them in a one-year ci- 
tation network (Havemann et al. 2007). Shortest- 
path length was also used by Botafogo, Rivlin, and 
Shneiderman (1992) and by Egghe and Rousseau 
(2003a) to measure compactness of unweighted net- 
works. Egghe and Rousseau (2003b) generalised it 
to weighted graphs and applied it to small paper 
networks. A drawback of shortest-path length is 
its high sensitivity to the existence or nonexistence 
of single links which can act as shortcuts Mitesser, 
Heinz, Havemann, and Glaser (2008). 

Here we propose and test another solution, which 
takes all or at least the most important possible 
paths between two nodes into account: we calcu- 
late resistance distances between nodes Klein and 
Randic (1993), Tetali (1991). To the best of our 
knowledge, resistance distance was not yet used for 
estimating the thematic similarity of papers. 

We avoid time-consuming exact resistance com- 
putation by applying a fast approximate iteration 
method applied by Wu and Huberman (2004). We 
also discuss other iterative approaches to the esti- 
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mation of node similarity based on more than one 
path (s. sections 2 and 5). 

In section 4 we validate that resistance is a real- 
istic measure of thematic distance if each citation 
link has an electric conductance equal to the inverse 
geometric mean of the number of the paper's refer- 
ences and the citation number of the cited source. 

2 Method 

We use the nearly bipartite citation network of pa- 
pers and their cited sources because projecting it 
on a one-mode graph of bibliographically coupled 
papers would reduce the information content of the 
data. The network is not fully bipartite because 
some papers are already cited in the year of their 
publication. 

In the electric model we assume that each link 
has a conductance equal to its weight. We calculate 
the effective resistance between two nodes as if they 
would operate as poles of an electric power source. 
Effective resistance has been proven to be a distance 
measure fulfilling the triangular inequation Klein 
and Randic (1993),Tetali (1991). 

One problem we have to solve before we can cal- 
culate distances is the delineation of the research 
field. It is not feasible and not necessary that the 
electric current between two papers flows through 
the total citation network of all papers published 
in the year considered. Field delineation should 
be done by an appropriate method for finding thc- 
matically coherent communities of papers Fortu- 
nato (2010), Havcmann, Glaser, Heinz, and Struck 
(2012b). 

A second problem is the weighting of the network. 
In bibliometrics, the strength of citation links is of- 
ten downgraded by dividing it by the number of 
references of the citing paper. We use a weighting 
of each link with the inverse geometric mean of its 
two nodes' degrees Havemann, Glaser, Heinz, and 
Struck (2012a): 
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Then, for citation links, we take not only the num- 
ber of references into account but also the number 
of citations the cited source recieves from the papers 
in the network. A citation link from a paper with 
many references to a highly cited source is weaker 



than a link from a paper with a short reference list 
to a seldom cited source. 1 

A third problem is that an exact calculation of 
all resistance distances between n nodes (e.g. pa- 
pers and cited sources in a field) requires an inver- 
sion of an n x n-matrix, a task of high complexity. 
Fortunately, we are only interested in similarities 
between papers and not between their many cited 
sources. Furthermore, we need only approximations 
of similarity rather than exact values. Therefore we 
can apply a fast approximate iteration method ap- 
plied by Wu and Huberman (2004) for community 
finding. We describe its details in Appendix A.l. 

This method is based on the fact that we know 
the effective resistance between two pole nodes in a 
network if we know the currents flowing from one of 
the two pole nodes to its neighbouring nodes. We 
can calculate these currents if we know the voltages 
of a pole's neighbours. From Kirchhoff's laws we 
know that — with the exception of the poles p and 
g — a node's voltage is the average of its neighbours' 
voltages, more precisely the weighted average with 
link conductances as weights. 

If we start with all voltages equal to zero (except 
the positive pole's voltage V p = 1) we obtain the 
true voltages of all nodes by iteratively averaging 
voltages according to the formula V <— F(p,g)V, 
where V is the voltage vector and F(p,g) is the 
row normalised weighted adjacency matrix of the 
network but with the pole nodes' row vectors filled 
with zeros with the exception of F pp = 1 (for details 
cf. Appendix A.l). 

There are other iterative approaches to the esti- 
mation of node similarity based on more than one 
path Leicht, Holme, and Newman (2006), Jeh and 
Widom (2002). Their convergence can only be as- 
sured by introducing a parameter < 1 for downgrad- 
ing the contributions of longer pathes. Introducing 
an auxiliary parameter should be avoided unless its 
value could be estimated from theoretical consider- 
ation or from empirical data (cf. section 5). 



(i) 3 Experiment 



We experimented with community-finding algo- 
rithms on a connected citation network of 492 in- 
formation science papers published in 2008 (Havc- 



1 Such a weighting follows the same reasoning as the TF- 
IDF scheme in information retrieval. 
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Figure 1: Histogram of the logarithms of resistance 
distances between all 120,786 pairs of 492 papers. 
The distribution of R is skewed, but that of log(i?) 
rather symmetric. 

mann ct al. 2011). 2 In this sample we have iden- 
tified three topics by inspection of titles, keywords, 
and abstracts (Havcmann et al. 2012b). Therefore, 
we also use it here to validate the measure of the- 
matic distance of scholarly papers. 

The 492 papers cite 13,755 different sources and 
21 other papers in the sample. We analyse the 
nearly bipartite graph of papers and sources con- 
nected by 17,196 citation links. For the electric 
model we have to consider the graph to be undi- 
rected. We can drop all the 12,013 sources cited 
only once because no current can flow through their 
citation links. We cannot neglect the 15 papers 
cited only once. We weight the links according to 
equation 1 where fcj is the degree of node i after 
dropping the sources with only one citation. 

The open-source C ++ -program (written by An- 
dreas Prescher) took about one hour to calculate 
the 492 • 491/2 = 120,786 distances with a maxi- 
mal error of 0.1 (s. Figure 1). If one only needs the 
distribution of distances it can be approximated by 
calculating distances of a random sample. Less then 
one third of distances (36,590) are needed to obtain 

2 Source of raw data: Web of Science. 



Figure 2: Cumulated number of topic papers ob- 
tained from ranking all 492 papers according to 
their normalised median distance to papers of the 
three topics. The black lines represent the ideal 
cases, where all papers of a topic rank above other 
papers. 

an estimated standard error of the estimated pop- 
ulation mean smaller than 0.01 (s. Appendix A. 3). 

4 Validation 

In earlier research we had identified three overlap- 
ping topics in our network, named bibliometrics 
(224 papers), Hirsch-index (42 papers), and webo- 
metrics (24 papers). We validate the measure of 
thematic distance by ranking all papers according 
to the median distance to papers of a topic and ex- 
pect the papers dealing with this topic at top ranks. 
Because we have not classified really all papers deal- 
ing with the topics considered, the ranking with re- 
gard to thematic similarity cannot be perfect. 

Another validation issue is that on average re- 
sistance distances between high-degree nodes are 
smaller than between low-degree nodes because all 
currents must flow through the immediate neigh- 
bours of the two nodes. The number of neigh- 
bours of a paper is the number of its references. 
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More referenced sources suggest that the paper 
deals with more topics — at least in the discussion 
section. Thus, it is not an artifact of the measure 
that papers with many references have smaller dis- 
tances to many other papers than papers with just 
a few references. In other words, they are often the 
central nodes in the graph. 

Therefore we have to assure that the central 
nodes do not distort the ranking of nodes with re- 
gard to distances to a topic when we validate the 
measure. We correct for centrality by dividing the 
median distance of a paper to all topic papers by its 
median distance to all papers in the sample. The 
curves in Figure 2 show for the three predefined top- 
ics that indeed the topic papers have top ranks if 
we rank according to this ratio of medians. 

This result is confirmed by a further test. We 
have used the resistance distances as an input for 
hierarchical clustering of papers. Ward clustering 
reconstructs the three topics with values of preci- 
sion and recall similar to the values we obtained 
with hierarchical link clustering (Havemann et al. 
2012b). 



lation to expectation is also necessary for the re- 
sistance approach if differences between distances 
have to be evaluated. A simple method is the one 
we apply for validation of our measure. We relate 
the observed to the median values of resistance dis- 
tances. 

If we want to obtain a similarity or distance mea- 
sure which is comparable between different net- 
works we have to relate resistance distances between 
nodes of a network to distances obtained in a null 
model of the network. The null model depends on 
the hypothesis we want to test with the measure of 
node similarity. 

Applying their approach to the case of any two 
nodes i and j with distance 2, Lcicht ct al. (2006) 
derive a similarity measure defined as the ratio of 
the number of common neighbours to the product 
kikj of their degrees (in contrast to the cosine sim- 
ilarity where this number is related to the square 
root of this product). If we estimate the current 
between two nodes which have common neighbours 
with the total current to the grounded pole g from 
its neighbours after one iteration we get 



5 Discussion 

There is another approach to node similarity which 
also takes all possible paths between the nodes into 
account and also leads to an iterative matrix multi- 
plication (Leicht et al. 2006, and references of this 
paper). 3 It is based on a self- referential definition of 
node similarity inspired by self-referential influence 
definitions (Pinski and Narin 1976, Brin and Page 
1998). 

One advantage of the self-referential approach 
compared to the iterative resistance calculation is 
that one needs only one global iteration procedure 
to obtain all node similarities in one run. The most 
severe disadvantage we see is that the self-referential 
iteration does not converge unless an auxiliary mul- 
tiplicative parameter < 1 is introduced which di- 
minishes the weight which longer paths is given in 
the similarity measure. 

Lcicht et al. (2006) derive their iteration proce- 
dure by relating the number of observed paths of 
some length to the (approximated) number of ex- 
pected paths between the two nodes. Such a re- 
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with Wi — ^2wu (cf. Appendix A. 2). For the net- 
work of a volume of papers and their cited sources 
there are only a few papers linked by a direct ci- 
tation i.e. the first term nearly always vanishes: 
uipg = 0. If the network is unweighted the similar- 
ity (measured with inverse distance) of two papers 
is then estimated by the sum of the inverse citation 
numbers ki of the sources cited by both papers 

JL A -A ■ 
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a reasonable new absolute measure of bibliographic 
coupling where highly cited sources contribute less 
to the coupling strength than sources cited only by 
a few papers. With the weighting defined in equa- 
tion 1 we obtain another measure of bibliographic 
coupling (cf. Appendix A. 2): 
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3 See the paper by Zhou, Lii, and Zhang (2009) for a dis- 
cussion of further measures of node similarity. 
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Its denominator is equal to that of the cosine sim- 
ilarity and the common sources in the sum are 
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weighted with the inverse product of the square root 
of their citation numbers and the sum over their cit- 
ing papers weigthted with the inverse square root 
of their numbers of references. 

We do not propose to use this expression as a 
new similarity measure but argue that it is a rea- 
sonable relative measure of bibliographic coupling 
which downgrades the coupling strength of highly 
cited sources and downgrades the contribution to 
their citation numbers coming from papers citing 
many other sources. This confirms the weigthing 
we use here. 



6 Summary 

We have validated that resistance distance calcu- 
lated in a citation graph is a realistic measure of 
thematic distance if each citation link has an elec- 
tric resistance equal to the geometric mean of the 
number of the paper's references and the citation 
number of the cited source. 



Acknowledgements 

This work is part of a project in which we develop 
methods for measuring the diversity of research. 
The project is funded by the German Ministry for 
Education and Research (BMBF). We thank An- 
dreas Prescher for developing the fast C ++ -program 
for the algorithm. 



A Appendix 

A.l Resistance Distance 

To calculate the total resistance between two nodes 
we apply the fast approximative method described 
by Wu and Huberman (2004). 

To obtain the total resistance between any two 
nodes p and g we set the voltage V p of the positive 
pole p to 1 and the voltage V g of the grounded pole 
g to zero. Thus we get the total tension U = V p — 
V g = 1. If we know the total current I between the 
two poles then we obtain the total resistance with 
R = U/I = 1/I. 



If we know the voltages Vi of the positive pole's 
adjacents i we obtain the total current / by sum- 
ming the currents 

Ipi = U P i/R P i = (Vp — Vi)/R P i = (1 — Vi)/R P i 

for all adjacents i. 

Conductance 1/Rij equals the link's weight Wij. 
We therefore get for the total current I between 
nodes p and g 

i = y] ipi = y w pi (i - vi) = w p - y w pi v (2) 

i i i 

where w p — w P i is the weight of node p. We can 
also calculate the total current from the currents 
flowing into the grounded pole: 

i i 

Each current J,-j through link (i,j) equals the 
voltage difference Uij of nodes i and j divided by 
the link's resistance Rif. 

I t] = U lJ /R VJ = (V l -V J )/R ir 

From Kirchhoff 's laws we know that the sum of cur- 
rents flowing out of a node i (which is not a voltage 
source) to its adjacents j is zero: J^j hj = 0, that 
means 

X ::l '< - y i)/ R ^ = V ^ - E v^- = o. 

j j 

From this we obtain that the voltage of node i is 
the weighted average of its adjacents' voltages: 

l < 'E"-<a- w 

3 

We obtain all the nodes' voltages by an iteration. 
For this, we turn equation 4 into a command 

3 

that means, in each iteration step, we get the new 
voltage of a node by averaging the old voltages of 
the node's adjacents and expect that the algorithm 
converges. 
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If we introduce the weight matrix W with row 
sums normalised to one by 



1 



W, 



3 



W, 



we can write the iteration command as V <— WV. 
Because the poles' voltages remain unchanged we 
use a matrix F(p,g) instead of W. F(p,g) is the 
row normalised weighted adjacency matrix of the 
network but with the pole nodes' row vectors filled 
with zeros with the exception of F pp (p, g) = 1. 

We only need the voltages of the positives pole's 
adjacents to obtain the total resistance beetween 
nodes p and j as 1// with equation 2. During the 
iteration, we estimate these voltages. We consider 
the series of estimated voltages and observe that 
they cannot decrease. This means, that the current 
/ estimated with equation 2 does never increase and 
the total resistance R = l/I does never decrease. 
From equation 2 we obtain a lower bound of the 
true total resistance. Analogously, from equation 
3 we get an upper bound. Both bounds converge. 
We stop the iteration if the difference between both 
bounds becomes smaller than a small positive num- 
ber e which acts as a measure of precision needed 
for the analysis. 

A. 2 First Approximation for Poles 
with Common Neighbours 

We start with voltages Vi = 0, Vz ^ p and V p = 1. 
The first iteration results in voltages 

The current reaching the grounded pole is then 



I g (l) = £ w gi Vi(l) 



E 



(6) 



If positive pole p and grounded pole g have a graph 
distance of two hops then w pg = 0. 
In the unweighted case = A^ and 

i 

If we weight according to equation 1 we obtain 
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A. 3 Distances of a Random Sample 

If we do not need distances between all \P\ papers 
but only the form of their distribution we can avoid 
to calculate all |P|(|P| — 1)/2 distances. In this case, 
we order all paper pairs randomly. Then the first n 
distances are a random sample from all distances. 
The standard error Sr of the average resistance R 
is then given by the square root of 



S' 2 



N - 



(N- l)n 



1 



E^ E* 



We stop calculating resistance distances if standard 
error Sr is smaller than e/10 for the last ten ran- 
dom samples. We can choose a relative large e for 
precision of each single resistance because the aver- 
age remains precise even if the averaged values are 
rounded. Both sums in the formula can be updated 
easily by adding the new terms to the last values of 
the sums. 



The formula for S R can be derived from 

2 _ N -n 2 



(N- l)n 



(8) 



where the variance of distances S* 2 can be estimated 
by the variance of the sample s 2 according to 



S 2 = 



n- 1 



(9) 



We have 

1 ™ 1 
s 2 = -£(i?,-i?) 2 = - 

r? ^ — ^ 7i 



i=l \i=l / 



leading to the formula for S R . 
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